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Abstract. One of the major challenges providing large databases like the 
WFCAM Science Archive (WSA) is to minimize ingest times for pixel/image 
metadata and catalogue data. In this article we describe how the pipeline pro- 
cessed data are ingested into the database as the first stage in building a release 
database which will be succeeded by advanced processing (source merging, seam- 
ing, detection quality flagging etc.). To accomplish the ingestion procedure as 
fast as possible we use a mixed Python/C-|— I- environment and run the required 
tasks in a simple parallel modus operandi where the data are split into daily 
chunks and then processed on different computers. The created data files can 
be ingested into the database immediately as they are available. This flexible 
way of handling the data allows the most usage of the available CPUs as the 
comparison with sequential processing shows. 



1. Introduction 

The WFCAM Science Archive (WSAQ; Hambly et al. 2007, Collins et al. 2006) 
holds the image and catalogue data products generated by the Wide Field Cam- 
era (WFCAM) on UKIRT (United Kingdom Infrared Telescope). The data 
comprise pipeline processed multi-extension FITS files (multiframes) containing 
pixel/image and catalogue data for four detectors at one pointing. The latter 
contains all detections of stacked multiframes. The data are pipeline processed 
at the Cambridge Astronomical Survey Unit (CASU) and transferred to Edin- 
burgh where the Wide Field Astronomy Unit (WFAU) processes it for ingestion 
into the database. Since the release database contains advanced products which 
can take a lot of CPU time to produce, it is preferable to carry out the ingest pro- 
cedure as fast as possible. Another aspect is that the pixel/image and catalogue 
data need to be ingested completely before further processing is done. Another 
constraint is the uniqueness of multiframes and detections. Each multiframe has 
to have a unique identifier across the whole database and each detection must 
be unique for a given survey. 



^http://surveys. roe. ac.uk/wsa/ 
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2. The Curation Usecases 

To ingest the large amount of data into the WSA as the first stage in build- 
ing a release database, a set of Curation Usecases (CUs) have been designed. 
They are coded in a Python/C++ environment, where C++ is used where high 
performance is needed and Python to facilitate an easy to use object-oriented 
environment. Table [1] shows an overview of the ingest CUs and the average 
volume of data to be processed per observing night. 

Table 1. The ingest curation usecases 



CU task volume/observing night 

1 Data transfer from CASU to WFAU 

2 Creation of compressed images (JPEGs) 

2 Extraction, process and ingest of 

multiframe metadata 
^ Extraction, process and ingest of 

catalogue data 



In the following we will concentrate on CUs 2 to 4, as the data transfer 
(CUl) is described in detail in Bryant et al. (2008). After transfer the data are 
split into daily chunks and then processed on multiple computers. 

The following dependencies need to be observed to avoid duplicate entries 
or missing data. Each FITS file gets assigned a unique multiframe ID associating 
data across the database with its source. Also each object in a catalogue gets 
assigned a unique object ID associating data across the database with its original 
detection. To update the database with the paths to compressed images (CU2), 
the general FITS file metadata has to be ingested beforehand (CUS). And 
finally catalogue data can only be processed (CU4) if the corresponding image 
metadata is available (CUS). 

3. Parallelization 

The normal procedure of executing the CUs for small amounts of data would 
comprise of a run of each CU followed by an ingest as shown in Figure [TJ This 
might then be followed again by the whole procedure for the next batch of data. 
As the creation of compressed images (CU2) itself can be done independently, 
it can be decoupled from image and catalogue data processing. Since this pro- 
cedure is linear, unique IDs for pixel files and catalogue detections are applied 
during CUS and CU4, respectively. 

To improve CPU usage and speed up the process for a whole cycle the 
following enhancements were applied: 



- 80 GByte 
~ 12000 JPEGs 

~ SOOO files 

~ S • 10^ detections 
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Figure 1. Non-parallel CU execution. The creation of the unique identifiers 
is indicated by the boxes denoted mfTD (multiframe ID) and objID (object 
ID), respectively. 



1. Files get multiframe IDs depending on the largest mulitframe ID in the 
database directly after transfer (GUI). The database is accordingly up- 
dated. 

2. The CU4 object ID is turned into a temporary negative, only per day 
unique ID. This way we avoid overlaps with already in the database ex- 
isting (positive) object IDs. It can be translated into a global unique ID 
directly after ingest with a simple addition to the last maximal object ID. 

3. The ingest process is completely de-coupled from extraction and process- 
ing. 

The first two steps allow us to process metadata simultaneously on different 
computers. The last one allows the ingest to be run on available ingestable data 
while new data is processed. Since ingest can take as much time as processing, 
this can nearly halve the run time of the full cycle. Figure [2] shows the flow 
chart for the final design. 



4. Automation 

A data daemon has been created that checks the available computers, their load, 
and the tasks that need to be executed as well as tasks already running. At the 
moment it suggests the distribution of these tasks to enhance usage of CPUs 
on the pixel servers. The database operator then takes the final decision in 
running the tasks. In addition to the data daemon the ingester daemon can 
be run, checking automatically for ingestable data and ingesting them into the 
database. This maximises CPU usage on the database server. 

Since these daemons can be run automatically, we are investigating the 
best way to run individual tasks remotely from a master computer. A number of 
Python-related solutions exist and are described in the Python Parallel Processing Wiki 



^http: / /wiki. python.org/moin/ParallelProcessing 
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Figure 2. Parallelized CU execution. The creation of the unique identifiers 
is indicated by the boxes denoted mflD (multiframe ID) and objID (object 
ID), respectively. 



5. Results 

Running each of the CUs 2 to 4 for 30 nights of data on five computers at the 
same time and ingesting the data as soon as it was available improved the total 
run times as follows: 

• CU2: up to 4x faster; 

• CUS: up to 3 X faster; 

• CU4: up to 2 X faster. 

The differences are due to the different ratios of time needed to process and ingest 
the data. During normal operations the gain is about 30-40% for concurrent runs 
of all CUs. 
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